List of Flash News about model security
| Time | Details |
|---|---|
|
2026-01-09 21:30 |
Anthropic Reports Classifiers Cut Claude Jailbreak Rate from 86% to 4.4% but Increase Costs and Benign Refusals; Two Attack Vectors Remain
According to @AnthropicAI, internal classifiers reduced Claude jailbreak success from 86% to 4.4%, indicating a substantial decrease in successful exploits. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the classifiers were expensive to run, impacting operational cost profiles for deployments. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, the system became more likely to refuse benign requests after adding the classifiers. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 According to @AnthropicAI, despite improvements, the system remained vulnerable to two types of attacks shown in their accompanying figure. Source: @AnthropicAI on X, Jan 9, 2026, https://twitter.com/AnthropicAI/status/2009739654833029304 |